Dynamic Topic Modelling of r/politics subreddit

This project allows:

  1. To gather data from Reddit and save it in csv format
  2. To clean gathered data and explore it
  3. To extract the main topics from gathered data
  4. To visualise dynamic changes of topics over time

To extract topics, we use BERTopic library, which performs topic modeling using clustering of vector representations of documents. The main differences between BERTopic and other topic models:

  1. High speed due to reducing the dimensionality of vector representations.
  2. Modular structure of the model pipeline: the stages of vectorization, dimensionality reduction and clustering are separated from each other, which allows you to easily and quickly experiment with different combinations of algorithm settings.
  3. The model pipeline consists of SOTA tools: SBERT, UMAP, HDBSCAN. Combined, this allows you to get the best results compared to other models.

This project can be easily adjusted to other sources of information, which allows you to conduct different experiments.

Install libraries

import os
from datetime import datetime
import time
from tqdm import tqdm
import pandas as pd
import spacy
import re

from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance, KeyBERTInspired

from sentence_transformers import SentenceTransformer

# from umap import UMAP
from cuml import UMAP

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

# from hdbscan import HDBSCAN
from cuml.cluster.hdbscan import HDBSCAN

import plotly.io as pio

pio.renderers.default = "notebook+vscode+jupyterlab"

sns.set_theme(style="darkgrid")
# %config InlineBackend.figure_format = "retina"

# Dictionaries:
# en_core_web_sm
# en_core_web_md
# en_core_web_lg
# en_core_web_trf

nlp = spacy.load(
    "en_core_web_sm",
    exclude=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"],
)

spacy_stopwords = list(spacy.lang.en.stop_words.STOP_WORDS)

Load and clean data from csvs

BERTopic uses Transformers. The model learns better if it receives more information from the text. Therefore, preprocessing is minimal.

Function to clean data from HTML elements using regular expressions

def regex_preprocessing(text):
    # Remove URL
    text = re.sub(
        r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+",
        " ",
        text,
    )

    text = re.sub(
        r"\(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+\)",
        " ",
        text,
    )

    # Remove special symbols
    text = re.sub(r"\n|\r|\<.*?\>|\{.*?\}|u/|\(.*emote.*\)|\[gif\]|/s|_", " ", text)
    text = re.sub(r"[^\w0-9'’“”%!?.,-:*()><]", " ", text)

    # Remove unnecessary brackets
    text = re.sub(r"\s\(\s", " ", text)

    # Delete unnecessary whitespaces
    text = re.sub(r"\s+", " ", text)
    return text.strip()

Function to convert data to a dataframe, drop duplicates in the dataframe and to apply ‘regex_preprocessing’ function to data

def data_preprocessing(file_name):
    data = pd.read_csv(file_name)
    data_cleaned = data.drop_duplicates(keep=False)
    data_cleaned["comments"] = data_cleaned["comments"].apply(regex_preprocessing)

    return data_cleaned

Function to create a dataframe with a cleaned data

This function consists of several steps:

  1. Firstly, it gets names of csv files in a chosen folder
  2. Secondly, it applies ‘data_preprocessing’ function to csv’s to create dataframes with cleaned data
  3. Lastly, it creates a combined dataframe with cleaned data
def process_data(directory):
    file_names = []
    for filename in os.listdir(directory):
        file = os.path.join(directory, filename)
        file_names.append(file)
    file_names.sort()

    dataframes = []
    for name in file_names:
        dataframes.append(data_preprocessing(name))

    cleaned_df = (
        pd.concat(dataframes)
        .drop(columns="time", axis=1)
        .reset_index(drop=True)
        .drop_duplicates()
        .dropna()
    )

    return cleaned_df

Apply data processing functions to gathered data

For this experiment, we load cvs’s with data marked as ‘hot’ by reddit algorithms.

directory = "original_data/hot"
combined_df = process_data(directory)
len(combined_df["comments"].to_list())
264230

Convert the dataframe to a list for a further work

comments = combined_df["comments"].to_list()
timestamps = combined_df["date"].to_list()

Create embeddings from cleaned data

The gte-small model was chosen using the Hugging Face benchmark. It is lightweight and works well with Reddit data.

# Pre-calculate embeddings
embedding_model = SentenceTransformer(
    model_name_or_path="thenlper/gte-small",
    cache_folder="transformers_cache",
)
embeddings = embedding_model.encode(comments, show_progress_bar=True)

Plot data distribution

We use umap to reduce the dimensionality of the data, which makes it easier to cluster the data using HDBSCAN.

def plot_umap(embeddings, values):
    neighbors_list = values

    fig, axes = plt.subplots(2, 5, figsize=(27, 10), sharex=True, sharey=True)

    axes = axes.flatten()
    for ax, neighbors in tqdm(zip(axes, neighbors_list)):
        umap_model = UMAP(
            n_neighbors=neighbors, n_components=2, min_dist=0.0, metric="cosine"
        )
        # Apply UMAP to our data
        umap_result = umap_model.fit_transform(embeddings)
        # Visualise the results
        ax.scatter(
            umap_result[:, 0], umap_result[:, 1], alpha=0.15, c="orangered", s=0.1
        )
        ax.set_title(f"UMAP, n_neighbors = {neighbors}")
        ax.set_xlabel("Компонента 1")
        ax.set_ylabel("Компонента 2")
    lim = 7
    plt.ylim(-lim, lim)
    plt.xlim(-lim, lim)
    plt.tight_layout()
    plt.show()


def plot_hdbscan(embeddings, umap_values, hdbscan_values):
    for n in umap_values:
        # Apply UMAP to our data
        umap_model = UMAP(n_neighbors=n, n_components=2, min_dist=0.0, metric="cosine")
        umap_result = umap_model.fit_transform(embeddings)

        # HDBSCAN
        sizes = hdbscan_values

        fig, axes = plt.subplots(1, 4, figsize=(20, 5), sharex=True, sharey=True)

        axes = axes.flatten()
        for ax, size in tqdm(zip(axes, sizes)):
            # Cluster data with HDBSCAN
            hdbscan_model = HDBSCAN(
                min_cluster_size=size, metric="euclidean", prediction_data=True
            )
            hdbscan_labels = hdbscan_model.fit_predict(umap_result)
            # Create a dataframe with results of UMAP and HDBSCAN
            df = pd.DataFrame(
                umap_result, columns=[f"UMAP{i+1}" for i in range(0, 2, 1)]
            )
            df["Cluster"] = hdbscan_labels
            # scatterplot for results
            sns.scatterplot(
                x="UMAP1",
                y="UMAP2",
                hue="Cluster",
                data=df,
                palette="tab10",
                legend=None,
                linewidth=0,
                s=0.5,
                ax=ax,
            ).set_title(f"n_neighbors={n}, min_cluster_size={size}")
            ax.set_xlabel("Компонента 1")
            ax.set_ylabel("Компонента 2")
        lim = 7
        plt.ylim(-lim, lim)
        plt.xlim(-lim, lim)
        plt.tight_layout()
        plt.show()

We plot a range of values to see how a structure of data changes: from a more local structure to a global one.

plot_umap(embeddings, np.arange(10, 56, 5))
10it [01:24,  8.42s/it]

We can see sizes of created clusters with different parameter combinations.

plot_hdbscan(embeddings, [15, 20, 25], [15, 35, 50, 75])
4it [02:08, 32.09s/it]

4it [02:28, 37.09s/it]

4it [02:16, 34.07s/it]

Extract topics using BERTopic

In this work, we use MaximalMarginalRelevance topic representation model, which changes the order of words in topics to remove semantic repetitions and create a sequence of the most significant words.

We use CountVectorizer from scikit-learn to:

  1. remove very rare and frequent words from the final topic representations
  2. create n-grams, up to 2 words in total
  3. remove stopwords using spaCy stopwords list

Function for Topic Modelling Pipeline

def topic_modelling(n_neighbors, min_cluster_size):
    # UMAP init
    umap_model = UMAP(
        n_neighbors=n_neighbors, n_components=5, min_dist=0.0, metric="cosine"
    )

    # HDBSCAN init
    hdbscan_model = HDBSCAN(
        min_cluster_size=min_cluster_size, metric="euclidean", prediction_data=True
    )

    # Remove noise from created topics
    vectorizer_model = CountVectorizer(
        stop_words=spacy_stopwords, min_df=0.03, max_df=0.99, ngram_range=(1, 2)
    )

    # BERTopic model init
    representation_model = MaximalMarginalRelevance()
    topic_model = BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        vectorizer_model=vectorizer_model,
        representation_model=representation_model,
        verbose=True,
    )

    # Fit the model
    topics, probs = topic_model.fit_transform(comments, embeddings)

    # Get topics over time
    topics_over_time = topic_model.topics_over_time(
        comments,
        timestamps,
        datetime_format="%Y_%m_%d",
        global_tuning=True,
        evolution_tuning=True,
        nr_bins=20,
    )

    # Plot Topics over Time
    plot = topic_model.visualize_topics_over_time(
        topics_over_time, top_n_topics=15, height=700, width=1200
    )

    return topics, probs, topics_over_time, plot

Experiments

This part is purely experimental and requires a lot of time to tune hyperparameters of model to get the best ouput results. This is one of the main problems of Topic Modelling. There is no metric for helping us to choose the best hyperparameters. Also, the best result of modelling may be subjective. That is why we run a series of experiments to have several results.

Generally, hyperparameters should be chosen taking into account several goals:

  1. To preserve the local structure of the data after reducing the dimensionality of the data with UMAP.
  2. To reduce the amount of noise in clusters and create an adequate number of topics with HDBSCAN.
  3. To create a list of understandable topics at the output.

UMAP n_neighbors = 15, HDBSCAN min_cluster_size = 35

topics, probs, topics_over_time, plot = topic_modelling(15, 35)
pio.renderers.default = "notebook+vscode+jupyterlab"
plot
2024-12-08 12:43:52,525 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-08 12:44:13,632 - BERTopic - Dimensionality - Completed ✓
2024-12-08 12:44:13,635 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [12:44:00.052435] Transform can only be run with brute force. Using brute force.
2024-12-08 12:45:02,962 - BERTopic - Cluster - Completed ✓
2024-12-08 12:45:03,007 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-08 12:45:26,080 - BERTopic - Representation - Completed ✓
16it [03:11, 11.95s/it]

UMAP n_neighbors = 15, HDBSCAN min_cluster_size = 50

topics, probs, topics_over_time, plot = topic_modelling(15, 50)
pio.renderers.default = "notebook+vscode+jupyterlab"
plot
2024-12-08 12:48:42,689 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-08 12:49:05,180 - BERTopic - Dimensionality - Completed ✓
2024-12-08 12:49:05,184 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [12:48:50.374168] Transform can only be run with brute force. Using brute force.
2024-12-08 12:49:53,314 - BERTopic - Cluster - Completed ✓
2024-12-08 12:49:53,349 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-08 12:50:11,762 - BERTopic - Representation - Completed ✓
16it [02:23,  8.97s/it]

UMAP n_neighbors = 15, HDBSCAN min_cluster_size = 75

topics, probs, topics_over_time, plot = topic_modelling(15, 75)
pio.renderers.default = "notebook+vscode+jupyterlab"
plot
2024-12-08 12:52:40,079 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-08 12:53:02,998 - BERTopic - Dimensionality - Completed ✓
2024-12-08 12:53:03,002 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [12:52:47.786826] Transform can only be run with brute force. Using brute force.
2024-12-08 12:53:53,905 - BERTopic - Cluster - Completed ✓
2024-12-08 12:53:53,940 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-08 12:54:10,500 - BERTopic - Representation - Completed ✓
16it [01:41,  6.33s/it]

UMAP n_neighbors = 25, HDBSCAN min_cluster_size = 35

topics, probs, topics_over_time, plot = topic_modelling(25, 35)
pio.renderers.default = "notebook+vscode+jupyterlab"
plot
2024-12-08 12:55:56,336 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-08 12:56:20,594 - BERTopic - Dimensionality - Completed ✓
2024-12-08 12:56:20,597 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [12:56:05.075884] Transform can only be run with brute force. Using brute force.
2024-12-08 12:57:10,667 - BERTopic - Cluster - Completed ✓
2024-12-08 12:57:10,701 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-08 12:57:32,180 - BERTopic - Representation - Completed ✓
16it [02:35,  9.71s/it]

UMAP n_neighbors = 25, HDBSCAN min_cluster_size = 50

topics, probs, topics_over_time, plot = topic_modelling(25, 50)
pio.renderers.default = "notebook+vscode+jupyterlab"
plot
2024-12-08 13:00:12,350 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-08 13:00:33,530 - BERTopic - Dimensionality - Completed ✓
2024-12-08 13:00:33,534 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [13:00:20.726477] Transform can only be run with brute force. Using brute force.
2024-12-08 13:01:21,580 - BERTopic - Cluster - Completed ✓
2024-12-08 13:01:21,614 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-08 13:01:37,926 - BERTopic - Representation - Completed ✓
16it [01:56,  7.27s/it]

UMAP n_neighbors = 25, HDBSCAN min_cluster_size = 75

topics, probs, topics_over_time, plot = topic_modelling(25, 75)
pio.renderers.default = "notebook+vscode+jupyterlab"
plot
2024-12-08 13:03:38,968 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-08 13:04:02,000 - BERTopic - Dimensionality - Completed ✓
2024-12-08 13:04:02,004 - BERTopic - Cluster - Start clustering the reduced embeddings
[I] [13:03:47.728912] Transform can only be run with brute force. Using brute force.
2024-12-08 13:04:50,250 - BERTopic - Cluster - Completed ✓
2024-12-08 13:04:50,287 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-08 13:05:05,903 - BERTopic - Representation - Completed ✓
16it [01:31,  5.74s/it]